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SUMMARY 


The  recogniton  of  speakers  in  an  open  set  [19],  text-independent  environment  is 
described.  The  recognition  occurs  without  any  prior  training  and  in  both  noisy  and  clear 
backgrounds  in  as  little  as  1.6  seconds.  Investigations  and  testing  were  done  in  the  areas 
of:  feature  characterization  of  speakers,  pre-filtering  of  classifier  input,  and  structure  of 
classifiers  for  recognition. 

A  feature-based  speaker  model  was  used  consisting  of  Linear  Prediction  Coefficient 
(LPC)  Cepstrum,  Reflection  Coefficients,  and  Mel  Cepstrum  for  classification,  and 
energy,  pitch,  zero  crossings  for  voiced/unvoiced  decisions.. 

A  prefiltering  structure  for  speech  input  segments  using  an  expert  system  implementing 
hypothesize  and  test  for  relevance  was  investigated.  It  attempted  to  maximize 
classification  performance  by  pre-selection  of  most  likely  voiced  speech  segments  prior 
to  classification. 

The  classifier  used  was  based  on  ART  [3]  and  fuzzy  Min-Max  [25].  It  is  a  neural 
network  with  output  categories  represented  by  a  fuzzy  hypercube.  A  hypothesis  and  test 
is  performed  by  the  network  for  overlapping  categories  where  their  fuzzy  membership 
representations  are  interpreted  as  degrees  of  typicality,  rather  than  relative  [15].  For 
category  control  both  a  vigilance  test  and  overall  hypervolume  limit  test  are  used.  The 
hypercube  limit  is  extended  beyond  the  umt  hypercube(as  in  [25])  to  allow  for  more 
“noisy”  feature  hypercubes.  The  network  has  7  layers:  input,  transform,  process, 
hypothesize,  test,  Actional,  and  category.  The  output  is  a  category  layer  represented  by 
a  fuzzy  feature  hypercube  for  each  created  class.  The  network  is  described  in  a  hybrid 
neuronal-functional  method. 

A  speaker  recognition  system  (based  on  [12,13])  was  tested  using  the  Switchboard  [27] 
and  Greenflag  [28]  data  bases.  Utterances  averaging  0.5  to  7.0  seconds  in  length  were 
tested,  with  over  5  hours  of  conversation  for  8  speaker  groups,  with  less  time  for  12  and 
16  speaker  groups.  The  fuzzy  hypercube  neural  network,  characterizing  one  speaker  per 
category,  produced  an  average  of  6.29  correct  and  0.29  incorrect  categories  out  of  a 
possible  8  total,  with  no  prior  training.  Overall  percent  correct  classification  was  found 
to  be  66.9%  average  for  8  speaker  groups. 
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1.  INTRODUCTION 


The  problem  of  text  independent  speaker  recognition  has  been  of  interest  to  many 
investigators  (see  Peacocke  [18]  for  an  introduction,  Atal  [1]  for  some  technical  issues.) 
Markel  and  Davis  [17]  obtained  text-independent  speaker  recognition  results  of  98% 
correct  requiring  an  average  of  39  seconds  of  speech.  The  proper  choice  of  signal 
features  for  effective  speaker  recognition  is  a  major  issue  (see  Reynolds  [20],  Soong  [26], 
Pellisier  [19]). 

The  system  described  requires  recognition  to  be  made  with: 

•  Noisy  environment 

•  Average  of  3  seconds  of  speech 

•  No  prior  leaming/training 

The  speakers  considered  were  taken  as  an  "open-set"  task  [19],  where  the  recognition 
system  had  to  classify  both  speakers  it  had  heard  and  not  heard  before.  Speaker 
recognition  involving  text  independent  information  in  an  open  set  environment  has  had 
limited  success  to  date  using  short-time  samples  [19]. 

This  effort  was  concerned  with  recognizing  an  individual  speaker's  voice  out  of  a  set  of 
voices,  in  a  text-independent  and  short-time  environment.  It  involved  two  investigations. 
First,  generation  of  a  descriptive  set  of  voice  features  sufficient  to  characterize  a  speaker 
in  the  problem  environment,  and  second,  formulation  of  a  reliable  classification  without 
any  prior  training  given  the  feature  set  based  on  voiced  segments. 

A  Speaker  Recognition  System  (SRS)  [12,13]  was  used  as  a  test  vehicle  which  accepted 
either  analog  or  digitized  voice  signals,  and  produced  a  speaker  characterization.  Featoe 
processing  developed  a  set  of  descriptive  signal  features  which  were  classified  into 
speaker  classes.  This  report  develops  details  for  the  following  areas  of  the  SRS. 

•  Feature  Processing 

•  Classifier  Pre-Processing 

•  Neural  Network  Classifier 

•  Test  Results 

Fuzzy  ART 

The  basic  operation  of  “adaptive  resonance”  in  the  standard  ART  is  earned  over  to  the 
fuzzy  ART.  The  basic  equations  which  govern  the  fuzzy  ART  are  based  on  the  equations 
from  the  standard  ART  architecture  where  the  intersection  operator  is  replaced  by  its 
fuzzy  counterpart,  the  minimum  operator.  An  introduction  of  the  mathematics  governing 
the  fuzzy  ART  is  given  here,  based  on  Carpenter  et  al.  [2,3, 4, 5]. 

The  fuzzy  ART  system  consists  of  three  layers:  the  input  layer  (FO),  processing  layer 
(FI),  and  output  category  (F2)  layer.  Associated  between  layers  FI  and  F2  are  a  set  of 
weights  directed  from  FI  to  F2.  A  fundamental  difference  between  the  Fuzzy  ART  and 
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prior  continuous  versions  are  the  simplification  of  the  “resonance  criteria”  by  use  of  only 
bottom-up  weights  in  the  matching  process.  The  matching  process  consists  of  two 
matching  operations: 

•  Degree  which  input  A  matches  output  category  C 

•  Degree  which  category  C  matches  input  A 

For  the  following,  the  norm  of  a  vector  A,  which  gives  an  indication  of  its  “size,”  is 
defined  as 

MI=EKI  (1) 

The  following  operations  and  data  structures  are  associated  with  each  of  these  layers: 
Input  Laver.  Given  an  input  vector  A,  A  =  {Oj }  or  optionally,  with  the  complement 

A  =  {aj,a^},  j  =  l,2,-,N„  (2) 

where  aj  =  1  -  aj  is  the  complement  of  Oj. 

The  addition  of  the  complement  of  the  input  vector  has  the  advantage  that  A  is  now  self- 
normalized,  using  the  definition  of  norm  in  Eq.  1 : 

IMII  =  ||("7  ’  "y")!  =  S  "y  +  E  (1  -  "/ )  =  S  (3) 

./=1  >1  >1 

Output  Layer.  The  output  layer  F2  consists  of  a  set  C  of  active  categories, 

C  —  {Cj } 

^  -'’max 

Each  category  vector  Cj  eC  has  an  associated  LTM  weight  set 

Processing  Laver.  A  category  Choice  Function  Tj  measures  the  degree  which  input  A  is 
a  match  to  category  Cj  and  its  associated  Wj  : 

\\AnW,l  ||MBV(^y,)|| 

■'“a  +  IKr  “  +  IKII 

where  a  >  0  is  a  choice  parameter. 

T  is  the  best  category  choice,  and  is  calculated  as  the  union  of  all  Tj . 

T  =\jTj  =MAX(Tj)  (5) 

J  '' 

There  are  two  possible  cases  that  can  occur  once  a  category  choice  is  attempted: 

Case  1.  Equation  5  produces  a  choice  J.  A  test  is  performed  on  the  preliminary  choice  J 
to  test  if  it  meets  a  threshold  criteria  called  the  vigilance  test,  where  the  degree  to  which 
the  preliminary  category  matches  the  input  A  is  compared  against  a  threshold  p 
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(6) 


\MIN{A,Wj)\ 

nrar"  pi 

If  the  vigilance  criteria  of  equation  6  is  not  met,  the  preliminary  choice  [J]  is  said  to  be 
“reset,”  and  another  category  choice  according  to  Eqs.  4  and  5  is  made  from  the  set  of 
active  categories  in  C.  If  the  vigilance  criteria  are  met,  then  the  system  is  said  to  be  in  a 
state  of  resonance,  and  the  input  A  is  incorporated  into  category  J  by  the  following: 

=  P[a  a  wf )  +  (1  -  0)wf  (7) 

Fast  learning  is  said  to  occur  when  P  = 


Case  2.  Equation  5  produces  no  choice. 
category  is  created  ^N+\  with 


^..new'  _  a 

“*  ^JV+1  ^  • 


If  no  category  choice  can  be  made,  a  new 


(8) 


Initialization:  N=0 

A  simplified  fuzzy  ART  architecture  is  described  by  Kasuba  [14]. 


2.  FEATURE  PROCESSING 

The  speaker  recognition  system  relies  on  the  underlying  model  assumptions  on  which  it  is 
based.  In  this  case  our  model  is  a  heuristic  one  which  loosely  follows  the  Linear 
Predictive  Coefficients  (LPC),  but  includes  other  features  to  add  fidelity  to  the  spectrum 
of  descriptive  power  of  the  system.  Prior  works  in  characterizing  speaker  features  have 
been  numerous.  Atal  [1]  identified  spectral  information  and  cepstrum  parameters  for 
Automated  Speaker  Recognition  (ASR).  Columbi  [9]  provides  an  overview  for  both 
speaker  and  listener  feature  models.  Other  models  are  the  RASTA/PLP  [11].  Soong  et 
al.  [26]  investigated  transitional  spectral  features  and  stated  “instantaneous  spectral 
features  carry  more  speaker  relevant  information  than  transitional  in  ASR.”  Reynolds 
[20]  investigated  several  features  and  widths,  and  reported  “simple  cepstral  mean  removal 
was  the  best  channel  compensation  technique  for  all  features”  (he  tested).  Pellisier  [19] 
specifically  investigated  features  in  the  open  set  recognition  case.  He  reported  that 
liftered  LPC  cepstral  with  normalized  log  energy  appended  are  optimal  for  the  TIMIT 
corpus,  and  LPC  reflection  with  normalized  log  energy  are  optimal  for  the  tactical 
GREENFLAG  corpus.  “In  general,  LPC  Cepstrum  appended  or  not,  perform  well.” 
Additionally,  [19]  found  that  transitional  features  did  not  perform  as  well  as  static 
features,  and  that  decision  fusion  techniques  are  the  best  means  of  capitalizing  on  the 
temporal  information.  Mel  frequency  cepstrum  also  performed  well,  but  not  as  well  as 
LPC  cepstrum  and  reflection  coefficients. 

The  characterization  of  a  speaker's  voice  signal  into  representative  features  can  be  broken 
into  several  basic  phases  of  processing:  Signal  Conversion  and  Formatting,  Signal 
Segmentation,  and  Feature  Processing. 
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2.1  Signal  Conversion  and  Formatting 

Voice  signal  is  converted  to  a  digital  signal  representation  for  the  next  stage  of 
segmentation  processing.  Raw  voice  signal  can  be  captured  by  a  microphone,  from  a 
receiver  detector,  or  other  transducer,  as  well  as  being  provided  by  a  digitized  database. 
This  signal  is  amplified  or  attenuated,  and  applied  to  an  A/D  converter. 

The  NIST  SPHERE  speech  format  standard  was  used  for  control  of  the  voice  signal  data 
used  in  the  investigation.  In  the  case  of  digitized  database  packages,  the  NIST/SPHERE 
voice  representation  was  used  as  an  interface  standard.  Additionally,  it  provides 
conversion  information,  such  as  Analog/Digital  rates. 


2.2  Signal  Segmentation 

Signal  Segmentation  consists  of  processing  the  digital  signal  to  determine  suitability  for 
the  actual  feature  space  representation  and  processing.  This  is  accomplished  through  two 
separate  operations  on  the  signal  segments,  time  segmentation,  and  voiced/unvoiced 
signal  set  partition. 

2.2.1  Time  Segmentation 

Time  segmentation  of  the  input  signal  develops  a  basis  for  segment  to  segment 
processing  and  averaging  over  many  segments.  The  segment  length  is  taken  from  EFT 
requirements  and  the  sampling  rates  for  the  Analog/Digital  converter.  Speech  segments 
in  the  range  20-50  ms  are  created  for  processing  by  the  system  one  segment  at  a  time. 
The  segments  can  overlap  by  0-100%  of  the  signal,  and  tests  using  different  overlaps 
were  performed.  A  more  detailed  view  of  the  segmentation  process  is  seen  in  figure  1.  A 
series  of  definitions  is  given  in  terms  of  signal  processing. 


VI  V2  V3 


Time  Segmentation  of  Speech  Signal 
Fignre  1 
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Signal.  Time  function  resulting  from  signal  conversion  and  formatting  operation  of 
section  2.1.  The  unprocessed  signal  V  has  the  basic  characteristics  of  being  non-periodic, 
bounded,  energy  limited,  duration  limited,  and  band  limited: 


V{T)  =  {x\x{t  +  T)^x{t),  -oo<t<oo}  [non-periodic] 

V[K)  =  {x;  |x(0|  <K,  -oo<t<oo}  [bounded] 

00 

V{K)  =  {x;  jx^  {t)dt  <  K}  [energy  -  limited] 

—  00 

V(T)  =  {x;  x(0  =  0  for  all  \t\>T  }  [duration  -  limited] 

00 

V{W)  =  {x-,X{f)=  ^xit)e'^^’^‘dt  =  Oforall\f\>W  }  -  limited] 

—00 

where  K  is  a  positive,  real  number,  T  is  a  period,  W  is  frequency  band 


Burst  Signal.  A  burst  5/  of  a  signal  Vj  is  a  consecutive  individual  duration-limited 
segment  of  a  signal,  in  a  series  of  one  or  more  non-overlapping  segments.  The  signal  set 
V  contains  all  the  signals  over  a  time  period  of  interest. 

V  =  \^Vj  ,  ^7  where j=l,2,...,N^  and  i=l,2,...,Nb  (9) 

j  i 

Bn  ^,2  =0  ^  over  all  i 

N„  is  the  number  of  signals,  Nt  is  the  number  of  bursts  in  signal  j.  All  the  Vj  are  assumed 
to  be  independent.  Each  burst  is  composed  of  a  series  of  non-overlapping  segments.  The 
following  criteria  on  the  bursts  hold: 

a)  The  segments  S[  of  the  burst  Bj  are  predominantly  from  the  set  of  voiced 
segments. 

b)  The  burst  length  is  limited  to  a  maximum  value  . 

A  relation  bounding  the  number  of  segments  in  any  burst  [i]  is  defined  as 

foralli=l,2,...,N,  (10) 

where  7,,^  is  the  segment  constant  time,  and  is  the  number  of  segments  in  burst  i 


Segment  Overlap.  Each  of  the  segments  in  burst  5/  is  said  to  uniformly  overlap  if 

each  consecutive  segment  has  [jj]  samples  of  signal  in  common  with  the  previous 
segment.  If  the  number  of  samples  in  a  segment  is  ,  we  have  the  following  relation 
for  the  degree  of  overlap,  Dop 


Do,- 


JJ 


N... 


The  SRS  testing  varied  Do,  from  0-50%. 


(11) 
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2.2.2  Voiced/Unvoiced  Signal  Set  Partition 

A  voiced/unvoiced  partition  of  the  signal  segment  set  is  made  through  an  algorithm  based 
on  [1],  This  set  partition  is  made  using  elementary  signal  features  such  as  average  zero 
crossings  [22],  average  pitch  [21],  and  average  log  energy  [22].  The  methods  to  develop 
each  are  described  below. 

Pitch  The  Average  Magnitude  Difference  Function  (AMDF)  [21]  is  used  for  pitch 
extraction.  It  is  a  variation  on  autocorrelation  analysis  where,  instead  of  correlating  the 
input  speech  at  various  delays,  a  difference  signal  is  formed  between  the  delayed  speech 
and  the  original.  At  each  delay,  the  absolute  magnitude  of  the  difference  is  taken.  At 
delay  =  0,  the  difference  signal  is  always  zero  but  exhibits  deep  valleys  at  delays 
corresponding  to  the  pitch  period  of  voiced  sounds.  The  AMDF  pitch  extractor  was 
chosen  because  it  gives  good  estimation  of  pitch  contour  and  requires  no  multiply 
operations  as  in  the  autocorrelation  method,  thus  improving  efficiency.  The  following  is 
the  AMDF  algorithm  for  extracting  the  Pitch  Period  per  segment  of  speech: 


Step  1 .  Using  the  Difference  relation  in  (1),  find  the  AMDF  for  delay  n  >=  0, 
where  n  =  0,l,...,Nsajn, 

N'  =  number  of  samples  in  the  subset  of  the  chunk, 
is  a  sample  from  the  original  signal, 

Sk-n  represents  a  sample  from  signal  delayed  by  n. 

N'  =N  *0.75 

sam 


A  percentage  of  samples  in  Eq.  12  of  75%  were  used  in  the  AMDF  correlation. 


(12) 


Step  2.  From  the  AMDF  find  the  first  pitch  valley  where  n  >  0.  The  delay  at  the  point  of 
the  valley  is  the  pitch  period.  The  inverse  of  the  pitch  period  is  the  pitch  P  of  the  voiced 
speech  in  frequency. 


Average  Zero  Crossings.  The  average  zero  crossings  is  determined  from  the  number  of 
sign  changes  in  a  signal  segment  over  time.  A  coimt  C  is  made  over  the  entire  segment 
length  T  by  counting  the  number  of  times  the  following  occurs  between  each  sample  x(n) 
and  sample  x(n-l)  in  the  segment. 


sign\x{n)\  ^  sign{x{n  - 1)] 
The  average  zero  crossing  n,  is  equal  to 
2/. 


n.  = 


-C 


(13) 

(14) 


Since  the  energy  of  voiced  speech  signal  is  concentrated  below  3  kHz,  and  the  energy  of 
fricatives  is  generally  above  3  kHz,  zero  crossing  information  can  be  used  as  a  feature  in 
voice/unvoiced  speech  characterization  [21]. 
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Average  Log  Energy  is  another  signal  measure  used  for  voiced/unvoieed  detection.  It  is 
eomputed  on  each  speech  segment.  The  energy  calculation  is  given  by  [22] 


■^log  -  1  10  ( ^  S,=i  ^ 


(15) 


Voiced/Unvoiced  Rule.  The  Voiced/Unvoiced  characterization  of  a  single  segement  is  a 
majority-based  decision  using  criteria  of  the  pitch,  zero  crossings,  and  Average  Log 
Energy  of  each  input  segment.  The  following  algorithm  was  used: 


Voiced/Unvoiced  Algorithm:  Given:  P,n,,£:,„g  for  a  segment 


Step  1:  M-0 

Step  2:  IF  <  n,  <  THEN  M=M+1 

Step  3:  IF  THENM=M+I 

Step  4:  CASE  M 

2:  “Voiced” 

0:  “Unvoiced” 

I:  IFP^^PiP^  "Voiced" 

ELSE  “Unvoiced” 

where 

P  ,P  are  the  minimum  and  maximum  pitch,  30-500  Hz 
”min»”max  the  minimum  (30)  and  maximum  (3000)  zero  crossing  frequency 
Emin  is  the  minimum  voiced  energy  threshold 
If  the  majority  of  the  tests  are  true  then  the  speech  segment  is  assumed  to  be  voiced. 
Otherwise  the  segment  is  assumed  unvoiced  and  is  discarded. 


2.3  Signal  Feature  Generation 

The  feature  processing  calculates  various  signal  transform  features  which  represent 
different  characterizations  of  a  speaker  through  his  voice  signal.  Linear  Prediction  Coding 
finds  the  coefficients  from  the  Inverse  Filter,  A(z),  defined  by  Markel  [16].  The 
significance  of  the  Inverse  Filter  is  that  it  can  realize  a  model  of  the  physical  speech 
production  system  such  as  the  Glottal  G(z),  the  Vocal  Tract  V(z)  and  the  Lip  Radiation 
L(z)  system  [3]. 

^(z)  =  l  +  ^a,z"'  (16) 

=  l/G(z)V(z)L(z) 

The  signal  feature  processing  is  performed  in  three  consecutive  phases:  a)  LPC  Analysis, 
b)  Mel  Cepstrum  Calculation  and  c)  Feature  Scaling. 
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2.3.1  LPC  Analysis 

The  LPC  analysis  consisted  of  setting  filter  constants  and  initialization  parameters, 
followed  by  Pre  Emphasis,  Hamming  Window,  Auto  Correlation,  D’Urbin  expansion 
(LPC/autocorrelation  and  Reflection  Coefficients),  LPC  Cepstrum,  and  Delta  Cepstrum. 

2.3. 1.1  Pre-Emphasis: 

A  given  segment  of  speech  is  pre-emphasized  by  the  following  function. 


M>{i)  =  s{i)  -  0.985(1  - 1), 
where  s(i)  is  a  sample  in  a  segment 


i  1,2,..., 


(17) 


2.3. 1.2  Hamming  Window: 

The  use  of  a  Hamming  window  reduces  effects  of  oscillations  and  poor  convergence. 


0.54-[0.46cos^]1 

nthprwrsip.  0 


(18) 


2.3. 1.3  Autocorrelation  Coefficients: 

The  auto  correlation  coefficients  C(i)  are  determined  by: 


ZS(y)S(y  +  0  fori=l,...,0,„  (19a) 

A'  /=o 

normalizing, 

fori=l,...,0,„„  (19b) 

where  0(.o„  is  the  correlation  order. 


2.3. 1.4 D’Urbin  Expansion: 

Function  to  compute  LPC  parameters  with  D’Urbin's  formula.  The  LPC  and  reflection 
coefficients  are  calculated  using  the  autocorrelation  coefficients  C(i). 

D’Urbin’s  formula: 

1.  Initialization 

/pc,  =  1.0 

lpc2  =  -— 

Co 

«  =  Cg  [1  -  IPCI  ] 

2.  Algorithm 

DO  FOR  i=2  TO  0,p, 
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k=2 

FORj=2TOO,p, 

Aj=lpcj+ri*lpc,_j^2 

Ipcj  =  Aj 

IpcQ  =  1.0 

FOR  j=l  TO  0,p, 

Ipcj  =  -Ipc,^^ 


2.3.1.5  LPC  Cepstrum 

The  Cepstram  [9]  is,  by  definition,  the  inverse  Fourier  transform  of  the  logarithm  of  the 
transfer  function.  The  Cepstral  Coefficients  were  obtained  directly  from  the  LPC 
coefficients.  Atal  defines  the  ceptsrum  as  the  inverse  Fourier  transform  of  the  logarithm 
of  the  transfer  function  [  1  ] . 

lni/(z)  =  C(z)  =  |;c,z-* 

The  all  pole  filter  model  based  on  predictive  analysis  on  speech  samples  is 


H(z)  = 


It  can  be  shown  that,  given  the  all-pole  model,  a  recursive  relation  exists  between  the 
cepstral  coefficients  c^  and  the  predictor  coefficients  a^. 
c,  =  Ipci 


=  Zw  “  jVpCi^^k-i  +^P<^k>  l<k<p 


(20) 


P  = 


L 

1000 


+r 


y  «3 


The  sampling  frequency  determines  the  number  of  poles,  modified  by  a  fudge  factor. 


2.3.1.6  Delta  Cepstrum  (from  [26]) 


Given  c„,  and  c„,  ,  the  cepstral  representations 
for  the  first  p  cepstral  coefficients: 

^CEP  ~  Z(^™  ~ 

W=1 


of  two  bursts,  the  delta  cepstrum  is  found 


(21a) 


In  order  to  equalize  the  contributions  from  individual  cepstral  components,  a  weighted 
cepstral  distance  is  desirable.  Using  the  Manalanobis  distance,  and  since  the  estimated 
covariance  matrix  is  essentially  diagonal,  we  obtain: 


9 


^WCEP  ^iii)  (21b) 

m=l 

where  the  weighting  coefficient  is  the  reciprocal  of  the  variance  of  the  mth  cepstral 
coefficient. 

The  generalized  slope  in  time  has  the  following  form: 

*c„,(r  +  A:) 

Ac,„  (0  =  -  (21c) 

k=-K 

2.3.2  Mel  Cepstral  Feature 

Linear  prediction  cepstral  coefficients  generated  fi-om  the  LP  spectrum  and  distributed 
along  a  linear  frequency  axis,  form  a  less  than  optimal  representation  of  an  auditory 
signal  since  a  logarithmic  fimction  of  frequency  better  approximates  the  ability  of  the 
human  ear  to  discriminate  frequencies.  The  Mel  scale  is  often  used  to  approximate  the 
resolution  of  the  human  auditory  system’s  perception  of  speech.  Deller  et  al.  defines  the 
Mel  as  "a  unit  measure  of  perceived  pitch  or  frequency  of  a  tone."  An  equation  for 
approximating  the  Mel  scale  is: 


F  = 

■'  mel 


1000 

log(2) 


logCl  +  '^^OO^ 


The  Mel  frequency  cepstral  coefficients  (MFCC)  are  obtained  by  Mel  warping  the 
spectrum’s  frequency  scale  before  taking  the  fast  Fourier  transform 

Mel  cepstrum  =  FFT(log|Mel  spectrum) 

Development  of  Mel  Cepstrum.  The  Mel  cepstral  coefficients  are  generated  by  the 
following  procedure: 

1.  Calculate  the  Mel  Bands:  The  Mel  bands  are  calculated  from  the  number  of  Mel 
bands,  and  the  start  and  end  frequencies,  4^^  and  f^^j, 

r  f  \ 

J  sic 


2595  log 


2  start  ,end 

V  700  . 


step  = 


/“  met  r  mel 

_  end  J  start 


N, 


bands 


Each  band  is  a  multiple  n  of  the  step  in  Mel  frequency  and  is  calculated  by: 


bandl  = 


10  2595  _1 


*700 

V  J 

Each  band  is  converted  to  the  integer  value  of  the  sample  to  which  it  corresponds. 


10 


band  ';'  =  [band^  -  bandi  ] 


N . 


2.  Weight  Bands:  A  square  filter  is  used  to  weight  the  bands.  It  is  an  all  pass  function  for 
each  of  the  Mel  bands. 


3.  Preemphasize:  See  section  2.3. 1.1  above 


4.  Hamming  Window:  See  section  2.3. 1.2  above. 


5.  Fast  Fourier  Transform  (FFT):  The  FFT  for  the  discrete  signal  with  points, 
which  is  a  power  of  2,  which  produces  the  discrete  fourier  transform  dft„ 


6.  Weighted  cepstral:  The  magnitude  of  the  DFT  is  weighted  by  the  appropriate  weight 
for  the  band  and  the  inverse  log  taken  to  form  the  cepstrum. 


c,  =log 


w. 


bandil]  -  bandi 


int 


(22) 


7.  Discrete  Cosine  transform  (DCT):  The  DCT  is  performed  on  the  weighted  cepstral 
components  to  obtain  the  final  result. 

N„„..  ( 


^  filter 

Ct,  =  ^l*C 


.  COS 


/M=0 


l7!:{m  -  0.5) 


V 


N 


filter 


1=0,1, 2,..., M 


(23) 


J 


2.3.3  Feature  Averaging 

An  averaging  of  each  of  the  features  was  done.  Each  individual  feature  is  averaged  over 
all  the  features  for  each  of  the  N^eg  segments, 


f' 

J  UthSCl 


unsealed 


(24a) 


2.4  Feature  Selection 

The  feature  sets  selected  for  final  implementation  for  speaker  recognition  were  based  on 
the  results  of  Pellissier  [19].  The  set  utilized  was 


•  LPC  Cepstral  (7  coefficients),  defined  by  equation  (20). 

•  Reflection  coefficients  (12  coefficients),  defined  by  D’Urbin  expansion  in  2.3 . 1 .4 

•  Mel  Cepstral  (13  coefficients),  defined  by  equation  (22) 

Additional  features  considered  during  testing: 

•  Delta  Cepstrum 

•  Pitch 

•  Energy 

•  Listener  Model 
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3.  CLASSIFIER  PREPROCESSING 


A  series  of  experiments  was  performed  to  assess  the  usefulness  of  preprocessing  speech 
feature  data  for  the  classifier.  The  overall  structure  for  preprocessing  is  a  test  structure 
which  develops  a  set  of  short-term  hypotheses  about  the  current  signal  and  tests  to 
determine  which  segments  of  the  signal  should  be  passed  on  to  the  actual  classifier  or  to 
long-term  hypothesis  memory.  A  block  diagram  of  the  preprocessing  scheme  is  shown  in 
figure  2. 


update 

Classifier  Preprocessing  System 
Figure  2 

The  hypothesis  and  test  paradigm  was  investigated  to  select  information  to  be  learned  by 
a  Neural  Network  Classifier  and  reject  information  that  was  unsuitable.  The  criteria  of 
the  selection  are  made  on  the  basis  of  the  intersegment  global  information  structure.  The 
segment  data  are  rated  according  to  their: 

1)  overall  rating  similarity 

2)  grouping  of  like  versus  unlike  segments  in  time. 

The  overall  rating  similarity  was  done  by  class  average  results  of  the  preclassification 
process,  i.e.,  for  each  potential  class,  an  average  of  the  result  was  given, 

avg  =a(i)/sum  a 

For  the  grouping  of  like  terms,  a  network  of  all  segments  in  a  “group”  of  segments,  which 
is  a  related  unit,  are  compared  in  their  time  relationship  to  each  other.  Thus,  if  two 
segments  next  to  each  other  are  of  like  pre-class,  the  linkage  is  strong,  whereas,  if  two 
segments  are  separated  by  an  unlike  segment,  they  have  less  linkage  and  so  on.  A 
directed  graph  of  relations  was  created  and  used  to  rate  linkage  strength. 
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The  incoming  signal  was  tested  by  subjecting  it  to  a  series  of  classifications  that  are 
stored  in  short-term  memory  (STM).  After  the  series  are  completed,  the  contents  of  the 
STM  are  tested  according  to  the  grouping  and  linkage  criteria  using  an  expert  system. 
The  results  of  the  test  determine  if  the  STM  contents  are  allowed  to  retrain  the  long-term 
memory  (LTM).  Each  data  point  is  retrained  and  if  accepted  by  the  test,  reclassified,  and 
finally  stored.  Additionally,  an  optional  periodic  retraining  of  long  term  memory  using 
all  accepted  signals  over  a  finite  period  is  done  to  eliminate  any  long  term  averaging 
effects  on  the  individual  speaker  signatures.  An  optional  output  of  each  result  of  the 
current  hypothesis  is  available  for  further  processing. 


4.  NEURAL  NETWORK  CLASSIFIER 

The  recognition  of  a  speaker  firom  a  set  of  features  requires  a  clustering/classification 
process  which  is  able  to  form  any  number  of  classes  dynamically,  and  tolerate  the  noisy 
and  overlapping  domain  of  speaker  feature  vectors.  In  this  effort,  ART  model  [3-7]  was 
used  to  cluster  and  classify  unknown  speakers.  There  were  three  networks  considered 
during  this  investigation:  ART  [3,10,12],  fuzzy  ART  [7,11],  and  fuzzy  hypercube  ART 
[25].  After  some  preliminary  testing  of  all  three  networks,  emphasis  was  placed  on 
modification  of  fuzzy  ART  neural  network  architecture  for  speaker  recognition. 


4.1  Basic  ART2  Neural  Net  [13] 

The  general  operation  of  the  basic  ART2  neural  network  architecture  is  described.  This 
forms  the  basis  for  the  fuzzy  ART  and  fuzzy  hypercube  ART  networks.  A  typical  ART2 
neural  network  is  composed  of  two  layers  of  fully  interconnected  neurons.  Adaptive 
connections  between  neurons  store  long  term  memory  (LTM)  traces  in  the  network.  LTM 
represents  information  that  the  network  has  learned.  Figure  2  shows  basic  architecture  for 


ART2  neural  network. 


Figure  3 


The  two  layers  (or  fields)  of 
neurons  in  an  ART2  architecture  in 
figure  3  form  the  Attentional 
Subsystem.  The  first  field  is  named 
Qj  the  Feature  Representation  Field, 

or  Fi.  Each  Fi  neuron  contains 
yj  processing  elements  that  form  three 

intra-PE  sublayers  which  are 
X  j  responsible  for  processing  one 

element  in  the  input  pattern. 

Vi 

Xi 
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The  main  function  of  the  feature  representation  field  is  to  enhance  the  current  input 
pattern's  salient  features  while  suppressing  noise  [23].  This  is  achieved  through  pattern 
normalization  and  thresholding  which  are  required  for  the  processing  of  analog  patterns. 
Normalization  compares  the  input  pattern  and  the  patterns  stored  in  the  network's  LTM 
traces.  Thresholding  maps  the  infinite  domain  of  the  input  patterns  to  a  prescribed  range 
[23].  The  second  layer  in  the  attentional  subsystem  is  called  the  Category  Representation 
Field,  or  F2.  Each  neuron  in  this  field  represents  one  category  (or  class)  that  has  been 
learned  by  the  network.  The  connections  from  a  particular  F2  neuron  store  the  pattern  of 
the  category  it  represents. 

ART2  utilizes  an  unsupervised  competitive  learning  technique  in  which  patterns  are 
represented  by  points  in  an  N-dimensional  feature  space.  Pattern  similarity  is  assessed  on 
the  basis  of  a  Euclidean  distance  which  states  that:  Patterns  that  are  sufficiently  close  to 
one  another  are  placed  in  the  same  category. 

The  N-dimensional  centroid  location  represents  that  class'  exemplar.  An  unsupervised 
learning  procedure  attempts  to  discover  the  distributions  and  centroids  of  the  categories 
for  the  patterns  it  is  presented. 

ART2  utilizes  a  "winner-take-all"  classification  strategy,  such  as  MAXNET,  that  operates 
in  the  following  manner: 

(1)  An  input  pattern  is  presented  to  the  feature  representation  field  where  it  is 
normalized  and  thresholded, 

(2)  The  resultant  signal,  which  is  called  short  term  memory  (STM),  is  passed 
through  bottom-up  connections  to  a  category  representation  field, 

(3)  Each  established  class  in  F2  responds  to  the  signal  with  an  activation  level 

which  it  sends  to  itself  through  excitory  connections  and  to  all  its  neighbors 
through  inhibitory  connections, 

(4)  Eventually  the  F2  neuron  with  the  highest  activation  will  inhibit  the  others. 
The  sole  remaining  active  F2  neuron  is  assumed  to  most  resemble  the  current 
input  pattern. 

Having  selected  the  winner,  the  Orienting  Subsystem  is  activated  and  determines  whether 
the  winning  neuron's  LTM  traces  sufficiently  resemble  the  STM  pattern  to  be  considered 
a  match.  The  degree  of  match  between  the  two  patterns  is  related  to  the  cosine  of  the 
angle  between  them  in  feature  space.  Patterns  that  are  very  similar  are  nearly  parallel  to 
each  other  while  dissimilar  patterns  are  orthogonal  to  each  other.  A  matching  threshold 
called  the  Vigilance  Parameter  determines  how  similar  the  input  pattern  must  be  to  the 
exemplar  to  be  considered  a  match  [24].  If  the  degree  of  match  computed  by  the 
orienting  subsystem  exceeds  the  vigilance  parameter,  a  state  of  resonance  is  attained  and 
the  STM  pattern  at  Fi  is  merged  onto  the  winning  neuron's  LTM  traces.  Otherwise,  the 
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orienting  subsystem  sends  a  reset  signal  to  the  wiiming  neuron,  and  inhibits  it  from 
competing  again  for  the  current  input  pattern  [23].  This  search  process  is  repeated  until 
either  an  F2  neuron  passes  the  vigilance  test  or  all  established  F2  neurons  have  failed  the 
test.  In  the  latter  case,  a  new  category  is  established  in  the  next  available  F2  neuron. 

Learning  is  considered  to  be  competitive  since  each  F2  neuron  attempts  to  include  the 
current  input  pattern  in  its  category  code.  The  actual  learning  process,  whereby  the 
current  input  pattern  is  encoded  into  the  network's  memory,  involves  modification  of  the 
bottom-up  and  top-down  LTM  traces  that  join  the  winning  F2  neuron  to  the  feature 
representation  field.  Learning  either  refines  the  code  of  a  previously  established  class, 
based  on  any  new  information  that  is  contained  in  the  input  pattern,  or  initiates  code 
learning  in  a  previously  uncommitted  F2  neuron  [3].  In  either  case,  learning  only  occurs 
when  the  system  is  in  a  resonant  state.  This  property  ensures  that  an  input  pattern  does 
not  obliterate  information  that  has  been  previously  stgred  in  an  established  class.  A  basic 
ART  architecture  was  used  in  prior  recognition  efforts  with  some  success  [12]. 

4.2  Fuzzy  ART 

The  basic  operation  of  “adaptive  resonance”  in  the  standard  ART  is  carried  over  to  the 
fuzzy  ART.  The  basic  equations  which  govern  the  fuzzy  ART  are  based  on  the  equations 
from  the  standard  ART  architecture  where  the  intersection  operator  is  replaced  by  its 
fuzzy  counterpart,  the  minimum  operator.  Several  of  the  operations  are  different, 
however.  The  top-down  and  bottom-up  matching  processes  are  combined,  since  the 
matching  between  input  and  category  is  the  same  in  both  directions. 

An  introduction  of  the  mathematics  governing  the  fuzzy  ART  is  given  here  based 
primarily  on  Carpenter  &  Grossberg  [5,  6,  7].  This  will  utilize  the  fiizzy  hypercube  ART, 
along  with  modifications  and  additions  in  the  next  section. 

The  fuzzy  ART  system  consists  of  three  layers:  the  input  layer  (FO),  processing  layer 
(FI),  and  output  category  (F2)  layer.  Associated  between  layers  FI  and  F2  are  a  set  of  bi¬ 
directional  weights  denoted  bottom-up,  directed  from  FI  to  F2,  and  top-down,  directed 
from  F2  to  FI.  The  following  operations  and  data  structures  are  associated  with  each  of 
these  layers: 

4  =a,  z  =  l,2,...,M  and,  optionally, 

FO;  '  (25a) 

A  =  =  (1  -  «/-M )  i  =  M  +  \,...,2M 

where  M  is  the  number  of  input  components  with  optional  complementation  and  number 
of  category  nodes  N. 

Note  that,  if  the  complement  is  added  to  A,  that  the  complement  coded  inputs  are  self- 
normalized: 

,  .  M  M 

\A\  =  J  +  Z(1  “  ^ 

/=]  ;=1 
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FI:  x  =  {_x^,...,x^^) 


Fj  inactive 


AaW/,  J"'  node  chosen 


(26a) 


through  choice  function  (27b) 


The  choice  is  made  as  “final”  if  the  preliminary  choice  x  meets  a  threshold  criterion 
called  the  vigilance  test, 


p,  or 


\A  AW, 


(26b) 


If  the  vigilance  criterion  is  not  met,  the  preliminary  choice  [J]  is  said  to  be  “reset,”  and 
another  choice  is  made  from  the  set  of  active  categories  in  y.  If  there  are  no  more  active 
categories,  a  new  category  is  created. 


y  =  (y^,...,yf,)with  associated 
W;  =(w^„-,Wj,2m)  weights(LTM) 

The  category  Choice  Function  Tj  is  defined  as: 

\A  a  w,. 

rj^  I  I  ' 

''  «  +  |w^.| 

where  or  >  0  is  a  choice  parameter,  and  the  norm  is  defined  as 

iFhSlAl 


(27a) 


(27b) 


(27c) 


The  category  choice  is  made  on  the  basis  of  a  maximum  function, 

Tj=m3x{Tj)  (27d) 

If  the  choice  of  category  made  in  (17d)  passes  the  vigilance  test  of  equation  (16b),  then 
the  category  is  accepted  and  learning  of  the  weights  occurs  as  follows: 

A  < )  +  (1  -  (27e) 

Fast  learning  is  said  to  occur  when  >9=1. 


4.3  Fuzzy  Hypercube  ART 

The  ART  neural  architectures  described  in  sections  4.1  and  4.2  both  did  not  perform  well 
during  speaker  recognition  testing.  They  generally  suffered  from  poor  tolerance  to  noise. 
Modifications  of  the  Fuzzy  ART  were  done  to  improve  performance.  Several  basic  ideas 
were  implemented.  One  was  the  current  representation  of  the  output  categories  as 
hypercubes.  An  overall  volume  parameter  bovmded  each  hypercube  volume.  In  order  to 
provide  some  noise  tolerance,  the  hypercubes  were  additionally  fuzzified.  Several  other 
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basic  functions  were  extended,  including  the  category  choice  function,  the  inclusion  of 
hypervolume  limits,  and  the  generalization  of  the  learning  algorithm  with  fuzzy 
hypercubes.  A  general  overview  will  be  given  of  the  network  layer  structure,  with  a  more 
detailed  functional  description. 

4.3.1  Fuzzy  Hypercube  ART  Structure 

The  fuzzy  hypercube  neural  network  has  seven  layers  of  processing.  Figure  4  shows  their 
interconnection.  Each  of  the  layers  is  briefly  described  below.  One  specific  item  to  notice 
is  that  the  network  is  both  feedforward  and  feedback.  Specific  category  information  is  fed 
back  to  the  Hypothesize  and  Fusion  layers  for  hypothesis  formation,  as  well  as  to  the 
Functional  layer  in  category  adjustment  and  learning.  Additionally,  the  resonate/no 
resonate  is  an  enable/inhibit  signal  which  effectively  cycles  the  entire  network  in 
processing  data  sets  synchronously. 


A 


Input:  fuzzified  and  optional  functional  expansion,  equations  (15a,b). 
Transform:  Category  choice  functions  are  evaluated  over  active  categories. 
Fusion:  The  category  ehoice  functions  are  fused  to  final  ratings. 
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Hypothesize:  A  final  category  rating  is  chosen  as  a  “hypothesis,”  otherwise,  a  new 
category  is  created. 

Test:  A  vigilance  pass/fail  test  is  performed  matching  input  to  chosen  category. 
Functional:  Categories  are  created,  hypervolume  adjusted,  or  input  learned. 

Category:  Hypercube  feature  vectors,  and  control . 


4.3.2  Fuzzy  Hypercube  Differences 

The  fuzzy  hypercube  ART  neural  network  has  several  distinct  differences  from  the  basic 
ART  and  fuzzy  ART.  It  retains  the  basic  data  structures  using  A  and  X  vectors.  The 
concepts  of  bottom-up  and  top-down  match  as  well  as  the  learning  rules  are  very 
different.  The  fuzzy  hypercube  layers  and  the  differences  between  prior  ART 
architectures  and  the  current  one  will  be  described  in  the  following  sections.  A  detailed 
view  of  the  network  is  shown  in  figure  5,  where  each  of  the  blocks  from  figure  4  are 
broken  down  to  the  next  level  of  details. 


4.3.3  Input  Layer 

The  Input  layer  has  several  inputs  and  outputs.  An  enable/disable  set  of  inputs 
effectively  controls  the  resonation  of  the  network.  The  network  is  either  allowed  to 
continue  cycling  through  with  the  current  input  A,  when  a  suitable  category  is  not  found 
by  the  Hypothesize/Test  layers,  or  to  stop  the  current  input  and  enable  the  acceptance  of 
the  next  input  upon  finding  a  suitable  category  (or  creating  a  new  one). 

The  Input  layer,  if  enabled  by  a  “no  resonate”  signal,  fuzzifies  and  optionally  expands  the 
input  information.  Each  input  dimension  in  A  is  translated  into  a  fuzzy  membership 
flmction  onto  the  [0,1]  interval,  which  indicates  the  degree  of  absence,  by  its  nearness  to 
its  lower  bound,  or  presence,  by  nearness  to  its  upper  bound.  The  translation  is  a 
mapping  F  =  {/}^  [0,1] 


This  operation  requires  a  pre-learning  of  the  maximum  and  minimum  f4„ 

expected  value  for  each  individual  feature.  For  each  feature  i,  ,  we  scale  it  to 


f scaled  ’  f  scaled 


/' 

J  w 


unsealed 


^ Max  ^ Min  I 


(28) 


The  set  of  scaling  coefficients,  |F^^  -F^J,  for  each  feature  [j]  can  be  considered  as 
weighting  factors,  determined  by  some  learning  function,  but  in  the  form  of  the  difference 
between  two  quantities,  not  absolute  values. 


The  method  of  determining  the  values  of  F^^  and  were  not  performed  during 

normal  operation  of  the  neural  network,  but  off  line,  and  provided  as  inputs  to  the 
process.  The  values  and  were  experimentally  determined  from  observation  of 
the  maximum  and  minimum  values  of  each  of  the  features  [j].  Note  that  outliner  feature 
values  outside  of  the  given  scaling  ranges  are  normalized  to  0.0  or  1.0  to  indicate  either 
full  membership,  or  no  membership  in  the  feature  set. 
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FHANN  Functional  Diagram 
Figure  5 


19 


4.3.4  Transform  Layer 

Membership  values  from  the  Input  layer  are  passed  to  the  Transform  layer,  where  they 
generate  two  membership  functions  for  each  active  category  node,  a  “Degree  of 
Inclusion”  (DOI),  and  a  “Degree  of  Perfect  Match”  (DPM).  These  memberships  together 
give  an  indication  of  the  degree  to  which  the  input  matches  each  feature  category 
hypercube.  The  development  of  memberships  is  done  through  a  fuzzy  procedure  (see 
Eqs.  31-34). 

The  choice  function  (Eq.  27b)  has  been  expanded  by  Carpenter  and  Gjaja  [8]  to  Choice- 
by-difference.  Simpson  [25]  develops  a  membership  function  which  measures  the  degree 
to  which  an  input  A  fits  within  the  hypercube  defined  by  Eq.  33.  He  defines  a  function 
bj,  which  approaches  1  as  the  point  gets  nearer  to  the  hypercube, 

[1  -  fk  - /(”,  - J  p9) 


where  f()  is  the  ramp  function, 

1  if  xy  >\ 

f{x,y)  =  <xy  if  0  <  xy  <  1 
0  if  xy  <0 


(30) 


The  choice  function  is  generalized  to  a  hypercube  match.  The  choice  function  is  defined 
by  two  related  nonlinear  functions,  the  degree  of  inclusion  and  the  degree  of  perfect 
match,  which  are  developed  in  parallel,  and  combined  by  a  fusion  function. 


Degree  of  Inclusion. 


The  degree  of  inclusion  (DOI)  function  measures  the  level  to  which  each  dimension  of 
the  input  Aj  is  inside  a  category  hypercube. 


DOI  is  a  trapezoidal  membership 
function  which  gives  full 
membership  whenever  an  element 
of  Aj  is  included  in  a  category, 
and  less  than  full  membership 
outside,  depending  on  the 
distance  to  the  hypercube.  Figure 
6  describes  the  shape  of  the 
membership  function. 


The  membership  for  DOI,  ju  (x)  is  defined  for  each  dimension  of  a  hypercube  Hj: 


Hj  = 


(31) 
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1.0 +  (»:-/»]) 

0 


if  X  >  hj  or  X  <  h^j 
if  h]>x>h] 


(32) 


if  h]<x<h] 
if  x>  hj  or  x<  h'j 


Usually,  \h]  -hj\  =  \hj  -h^  to  evenly  flizzify  the  hypercube.  The  overall  membership 
function  /u  (x)  is  the  sum  of  the  individual  memberships: 


/;“»(*)  p3) 

J 

Degree  of  Perfect  Match.  The  measure  of  the  distance  from  the  mean  of  each  dimension 
of  Hj  is  defined  as  the  degree  of  perfect  match  (DPM).  The  DPM  is  a  similarity  relation 
between  the  input  x  and  an  individual  category.  The  dissimilarity  is  defined  as  the 
difference  between  the  value  x  and  the  mean  of  the  category  x,  mj: 

Dissimilarity j  s  lx  -  j  (34a) 


The  similarity  is  the  complement  of  the  dissimilarity, 

Sim^  =  Dissimilarity^  =  1  -  |x  -  (34b) 

The  membership  for  DPM,  //^^^(x)is  defined  for  each  dimension  of  a  hypercube  Hj 
and  is  derived  as  follows.  The  mean  of  each  dimension  is 


and  the  membership  function  for  each  dimension)  is 


■'  ^  [  l-|x- P(x)z/  Wj 

where  P(x) ,  the  possibility  of  vigilance  is  defined  as. 


(34d) 
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(34e) 


The  overall  membership  function  //  (x)  is  the  sum  of  the  individual  values, 

//--(x)  =  X/^r  (34f) 


4.3.5  Fusion  Layer 

The  DOI  and  DPM  from  the  Transform  layer,  as  well  as  certain  feedback  category 
information,  comprises  the  input  to  the  Fusion  layer.  The  fusion  of  the  membership 
functions  for  degree  of  inclusion  and  degree  of  perfect  match  are  done  with  a  dynamic 
weighting  and  normalization  of  the  two  functions.  The  dynamic  weighting  is  done  to 
compensate  for  low  DOI  at  the  start  of  a  matching  process 

=  +  (35a) 


where 

kj  =  mm{k^  *  A^C,l) 

=  k^ 

and  NC,  the  node  constant,  is  a  dynamic  weighting  function  defined  as: 


(35b) 


NCU)  = 


'0.65,  N(,=\ 
0.85,  N,.=2 
0.95,  Nc=3 
1.00,  N^>3 


(35c) 


4.3.6  Hypothesize  Layer 

The  inputs  from  the  Fusion  layer  form  a  number  of  potential  hypotheses  from  which  a 
single  hypothesis  is  chosen.  The  hypothesis  is  formed  by  a  maximum  over  all  the  input 
possibilities. 


Winner,  Q  =  max{/?*} //"  R'' is  active  and  i?*>0 

k 

NoWinner,if  i?*  is  inactive  over  0<k<n 


(35d) 


The  resultant  hypothesis  of  Winner  is  passed  with  the  winning  category  node  to  the  Test 
layer,  while  the  No  Winner  Hypothesis  is  passed  back  to  the  Input  layer  to  halt  resonation 
of  the  network,  as  well  as  to  create  a  new  category  node  for  the  current  input  A. 
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4.3.7  Test  Layer 

The  Test  layer  performs  the  vigilance  test  on  the  current  input  A  and  the  category  input 
hypothesis.  The  vigilance  test,  in  the  standard  and  fuzzy  ART,  is  a  vector  matching 
process  as  shown  in  Eq.  29b.  In  the  fuzzy  hypercube  ART,  the  vigilance  test  is  a  general 
test  for  category  hypercube  membership.  The  test  is  performed  using  a  modified  form  of 
Eq.  16b, 


I^L  adi 

where  =  pg{n) ,  g(n)  is  the  vigilance  adjustment  function,  and  n  is  number  of  times 
a  category  is  visited. 

The  Test  layer  has  several  outputs  depending  upon  the  result  of  the  vigilance  test.  If  a 
category  passes  the  test,  the  Input  layer  is  signaled  to  halt  resonation  of  the  network,  and 
that  category  is  passed  to  the  Functional  Layer.  Additionally,  the  category  layer  is  re¬ 
enabled  for  all  nodes  to  compete  in  hypothesizing  and  testing  of  the  Fusion  and 
Hypothesis  layers. 

In  the  case  when  a  category  fails  the  vigilance  test,  the  Input  layer  is  signaled  to  continue 
resonation  and  hence  block  any  input  until  either  a  category  is  matched  or  a  new  one  is 
created.  Additionally,  the  category  which  failed  the  vigilance  test  is  prohibited  from 
competing  with  the  current  input  until  either  another  category  passes  the  test,  or  a  new 
category  is  created. 

4.3.8  Functional  Layer 

The  Functional  layer  is  a  series  of  services  performed  on  the  final  Category  layer.  These 
services  are:  Hypervolume  Measure,  Hypervolume  Test,  Hypervolume  Adjust, 
Hypercube  Learning,  and  Hypercube  Creation. 

Hypervolume  Measure.  The  hypervolume  hv  is  calculated  by  the  product  of  the  LTM 
weights  as  below: 

hv  =  f[(W,  -V,)  (36) 

Hypervolume  Test.  The  overall  hypervolume  of  each  hypercube  is  maintained  within 
bounds  in  order  to  keep  the  hypercubes  from  expanding  to  infinite  volume.  The  limit  is 
essentially  a  bound  for  learning  in  the  network.  The  volume  parameter  is  defined  as 
follows: 


-Vj  )<  volume  or 
nA  <  volume 


(37) 


The  value  N  is  the  number  of  input  nodes  and  A  the  hypervolume  per  node.  The 
hypervolume  limit  testing  and  adjustment  is  necessary  since  each  of  the  dimensions  of  a 
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hypercube  are  not  constrained,  such  as  in  the  case  of  fuzzy  ART  where  the  weights  must 
be  strictly  decreasing.  In  [25],  the  hypervolume  limits  on  the  categories  are  limited  to  the 
unit  hypercube  as  follows: 

N 

E(max(w^,  )  -  min(v^,  ,a„. ))  <  0, 

0<©<1 

The  problem  with  the  hypervolume  test  above  is  that  it  cannot  easily  accommodate 
“noisy”  hypercubes  in  the  category  layer.  Since  this  is  a  problem  with  the  basic  fuzzy 
ART,  the  limit  must  be  changed  to  allow  for  noisy  data.  In  the  case  of  the  fuzzy 
hypercube  ART  network,  the  hypercube  volume  is  constrained  to  be  less  than  a  mavimnm 
limit,  hv„3,„ 

(39) 

An  additional  parameter  is  defined,  hypercube  dimension,  hd^ia^,  assuming  equal  size  in 
each  dimension: 

^  (40) 

Hypervolume  Adjust.  If  the  limit  hv„,3,(  is  exceeded,  the  entire  hypervolume  is  adjusted  to 
maintain  inequality  (Eq.42b).  The  excessive  volume  Ahv  is  found  from  the  current 
hypervolume,  hv,  by  the  following: 


Ahv  = 


(41) 


f  {hv-hv^^)/N  hv>hv^^ 

[o  Otherwise 

where  N  is  the  input  dimensionality.  The  hypervolume  of  the  current  category  [J]  must 
be  adjusted  whenever  Ahv  >  0  by 

rf"'  =  max{(r;"  -  AAv),0} 

F;"=mm{(F*+AAv),l) 

This  operation  brings  the  hypervolume  of  each  selected  category  within  the  value  of 

hv„,.. 


Hypercube  Learning.  The  inclusion  of  input  A;  into  the  winning  category  hypercube  Bj  is 
done  through  a  learning  algorithm  which  adjusts  the  hypercube  of  category  [J].  In 
general,  each  value  of  Aj,  selectively  adjusts  its  respective  limits  in  Wj  and  Vj . 

Given  an  input  vector  Aj  and  a  hypercube  Bj,  and  a  learning  adjustment  factor  r,  learning 
on  a  case  by  case  basis  is  performed  for  each  dimension  of  A  over  the  entire  chosen 
category  Bj  as  follows: 

Case  1 :  Initialization. 

_  ynew  _  ^ 

'  ,/  //  (43a) 

whenever  and  V“l‘^  > 

Case  2:  Input  is  above  Wj 
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(43b) 


fr;r=ff^f(l-r)  +  rA, 

whenever  A,  >  Wf  and  Aj  <  Vf  + 
Case  3:  Input  is  below  Vj 

V’jr  =  V;!W-r)^rA, 

whenever  A,  <Vf  and  A^>Wjf-hd^^ 
Case  4:  Input  is  within  B,. 

Whenever  A,  >F;'‘'  and  A,  <Wf: 


(43c) 


4a)  Input  is  closer  to  W 

Wr=W;;‘'(}-r)-rA, 


whenever Wjf  -  4  >  4  -Vf 


4b)  Input  is  closer  to  V 

Vr  =  Vf(\-r)  +  rA, 

whenever  4  -  Vjf  >  Wf  -  4 


Case  5: 


v;r  =  A,  -  Wf  -hd, 
whenever  A,  >  W, 


max 

old 

JI 


and 


Case  6: 


wr=vf -A,  +hd. 


whenever  Aj  <Vjj 


max 

old 


A  <Vj1‘‘+hd„ 


and  A,>Wj;‘’-hd„ 


(43d) 


(43e) 


(43f) 


(43g) 


Hypercube  Creation.  The  creation  of  a  hypercube  requires  that  the  overall  hypervolume 
limit  is  adjusted  through  the  hypercube  dimension,  hd^ja^^  which  depends  on  the  number 
of  categories  in  the  network,  N,  from  equation  30. 

4.3.9  Category  Layer 

The  Category  layer  consists  of  a  set  of  complex  neurons  with  associated  states  and  LTM 
weight  values  which  describe  them.  The  LTM  weights  are  associated  with  the  min-max 
feature  hypercube  representation  of  the  associated  J-categories  defined  by  Simpson  [25]. 
Each  hypercube  category  C  is  a  fuzzy  cluster  defined  by: 

J  =  IX-,N,„  (44) 

s[0,l] 

where  is  the  count  of  adjustments,  T'^  is  the  confidence  and  S'’  is  the  state  of 
category  [j].  Bj  is  the  hypercube  representation  of  category  j,  Vj  is  the  minimum  point,  Wj 
the  maximum  point,  and  N,„^  the  total  number  of  categories. 
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4.4  Category  Merge 

A  global  merge  is  defined  as  the  combination  of  cluster  classes  produced  by  the  neural 
network  which  are  very  “close”  to  one  another.  This  operation  is  performed  outside  of 
the  neural  network  processing  and  does  not  affect  any  of  the  internal  operation  of  the 
network.  It  does,  however,  utilize  detail  parameters  generated  by  the  network,  and  hence 
can  be  considered  a  higher  order  operation  of  the  network  which  is  bound  to  its  operation. 
This  process  also  occurs  over  time  between  NN  cycles  and  can  be  considered  a  long¬ 
term-averaging  process. 

4.4.1  Merge  Parameters 

There  are  two  measures  which  are  used  to  indicate  whether  a  global  merge  is  to  take 
place: 

a)  Volume  difference  between  hypercube  categories 

b)  Magnitude  of  rating  R  from  equation  (41)  between  two  categories. 

4.4.2  Merge  Criteria 

A  function  is  defined  which  performs  the  category  merge.  First,  the  merge  parameters 
are  obtained  over  all  possible  different  pairs  of  the  current  categories  defined.  Next,  the 
merge  criteria  are  applied  and  used  to  partition  the  current  categories  into  a  final  set  of 
categories  which  is  compacted  using  the  criteria.  Note  that  the  compacting  occurred  very 
rarely  during  testing. 

The  criteria  are  expressed  in  terms  of  acceptance/rejection  regions  in  the  volume 
difference/rating  mapping. 

0.0<  Avo/(cl,c2)<1.10W  i?(cl,c2)>1.00  OR 

1.1  <  Avo/(cl,c2)  <  island  R{c\,c2)  >  1.00  OR  (45) 

1.5  <  Avo/(cl.c2)  <  \.15and  i?(cl,c2)  >  1.40 

These  were  experimentally  derived  and  were  only  used  to  evaluate  the  concept  of  global 
clustering  criteria  within  the  context  of  the  hypercube  structure. 


4.5  Initialization 

The  initialization  is  performed  on  the  network  as  follows. 

1.1  Enable  all  categories,  set  count,  and  confidence  is  “none”. 

N’’  =0,  T'^  =  none.  S’’  =  enabled  (46a) 

1.2  Set  all  categories 

F;r=l,  irr=0,  for i  =  1,2, ...N  J  =  1,2,...,V_  (46b) 
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5.  TEST  METHODOLOGY 


5.1  Test  Data 

There  were  two  data  sets  used  for  the  formal  testing  of  the  system,  the  Switchboard  [27] 
and  the  Greenflag  [28], 


Spkr  1 

Spkr  2 

Spkr  3 

Spkr  4 

Spkr  5 

Spkr  6 

Spkr  7 

Spkr  8 

M/F 

Set! 

02 

15 

38 

46 

62 

81 

28 

33 

6/2 

Set  2 

04 

41 

72 

02 

15 

38 

46 

62 

5/3 

Set  3 

05 

23 

27 

42 

56 

59 

17 

32 

4/4 

Set  4 

15 

38 

46 

62 

81 

28 

33 

76 

7/1 

Set  5  (1) 

04 

41 

72 

02 

15 

38 

46 

62 

Set  5  (2) 

32 

35 

65 

90 

39 

82 

10/6 

Set  6  (1) 

15 

38 

46 

66 

04 

41 

72 

02 

Set  6 (2) 

65 

90 

39 

82 

Set? 

57 

17 

32 

35 

65 

90 

39 

82 

5/3 

Sets 

72 

02 

15 

38 

46 

62 

81 

28 

5/3 

Set  9 

90 

39 

82 

04 

41 

72 

02 

15 

4/4 

Table  1:  Switchboard  95  Test  Set  to  Actual  Speaker  Reference 


Spkr  1 

Spkr  2 

Spkr  3 

Spkr  4 

Spkr  5 

Spkr  6 

Spkr? 

Spkr  8 

Setl 

CCZ 

ccv 

cdi 

cdk 

ccd 

cch 

cdn 

cdt 

Set  2 

cdw 

clp 

cfi 

cel 

cev 

cfs 

cfu 

cfec 

Set  3 

cfi 

CCV 

cdi 

cdk 

ccd 

cin 

cdn 

cdt 

Set  4 

cga 

cfp 

cfi 

cel 

cev 

cfs 

cfu 

cfx 

Set  5 

cgm 

chs 

ckd 

chc 

cdk 

chg 

cch 

CCZ 

Set  6 

Cgp 

chs 

ccd 

chc 

chy 

chg 

cfu 

cfx 

Set? 

cgx 

chs 

ccd 

chc 

cdk 

chg 

cfii 

CCZ 

Sets 

chc 

cel 

cdc 

cgx 

ccd 

cin 

cdn 

cdt 

Set  9 

chj 

chs 

chn 

cii 

cin 

chg 

cif 

cik 

Set  10  (1) 

CCZ 

ccv 

cdi 

cdk 

ccd 

cch 

cdn 

cdt 

Set  10  (2) 

chj 

chs 

chn 

cii 

cin 

chg 

cif 

cik 

Set  11 

chy 

cel 

cji 

cgx 

ccd 

cin 

cdn 

ceb 

cdv 

cen 

cge 

cif 

cic 

Set  13 

ckc 

chs 

ckb 

cii 

cin 

chg 

cif 

cik 

Set  14  (1) 

CCZ 

ccv 

cdi 

cdk 

ccd 

cch 

cdn 

cdt 

Set  14  (2) 

chj 

chs 

chn 

cii 

Set  15  (1) 

cga 

cfp 

cfi 

cel 

cev 

cfs 

cfu 

cfx 

Set  15  (2) 

cgm 

chs 

chc 

chg 

Table  2:  Greenflag  Test  Set  to  Actual  Speaker  Reference 


5.1.1  Switchboard  data  set. 

The  Switchboard  data  were  grouped  into  sets  of  8,  12,  and  16  speakers.  The  actual 
breakout  of  the  speakers  is  shown  in  Table  1.  The  vertical  entries  are  the  9  speaker  sets 
consisting  of  the  actual  speakers  given  by  the  file  numbers  of  individual  speakers  in  the 
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data  set.  Additionally,  the  numbers  of  male/female  speakers  is  given  in  the  last  column. 
For  a  detailed  description  of  the  Switchboard  database  see  [27]. 

5.1 .2  Greenflag  data  set. 

The  Greenflag  test  data  were  organized  the  same  as  the  Switchboard,  except  that  there  are 
15  sets.  The  set  ID’s  are  given  by  three  letter  combinations  all  beginning  with  a  “c”.  See 
Table  2,  Greenflag  Test  Set  to  Actual  Speaker  Reference,  for  the  breakout.  For  a  detailed 
description  of  the  Greenflag  database  see  [28]. 

5.2  Test  Conditions 

The  subsystems  of  Feature  Processing,  Feature  Preprocessing,  and  Neural  Network 
Classifier  were  tested  using  the  test  data  described  in  section  5.1.  There  were  a  number 
of  fixed  and  varied  parameters  corresponding  to  specific  subsystems  as  given  below. 

All  the  below  parameters  are  specifically  related  to  the  hypercube  network.  The  basic 
ART  and  fuzzy  ART  have  different  parameters  and  are  so  indicated  below. 

a)  Feature  Processing:  fixed  parameters 
Mel  Cepstral  Parameters  =13 
Reflection  Coefficients  =12 
LPC  Cepstral  Coefficients  =  7 
Correlation  Order  [Ojo„=  13] 

Number  of  LPC  poles  [p  =  14] 

Number  of  Mel  bands  [Nband5=  12] 

Max/Min  values  of  Features  [see  Table  3] 


fO 

n 

f3 

f4  . 

fs 

f6  \ 

f7  . 

f8 

fio 

m 

n2 

LPC  cepstrum  Maximum 

9||| 

0 

m 

Mel  cepstrum  Maximum 

71.3 

6.9 

-1.5 

9.5 

6.6 

2.7 

■ 

Reflection  Maximum 

in 

LPC  cepstrum  Minimum 

-0.1 

Mel  cepstrum  Minimum 

2.7 

4.5 

5.4 

3,6 

-1.5 

SI 

-0.6 

Reflection  Minimum 

■1 

Table  3:  Maximum/Minimum  Values  of 


Features 


b)  Signal  Segmentation:  variable  parameters 

Total  Number  of  Segments  of  Voice  Speech  Processed 
Average  Time  per  Voiced  Speech  segment 
Minimum  Time  per  Voiced  Speech  Segment 

c)  Signal  Segmentation  and  Voiced/Unvoiced:  fixed  parameters 

Number  of  samples  per  segment  [Nsan,=  128] 

AMDF  fraction  of  samples  per  chunk  [0.75] 

Minimum  and  maximum  pitch  [  =  1.9,  =  18.0  ] 

Minimum  and  maximum  zero  crossing  frequency  [«^i„  0.6,  =  5.0  in  Khz] 
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Minimum  voiced  energy  threshold  =  1000] 

d)  Feature  Preprocessing 

Rule  base  for: 

IF  (Proportional#  matched  segments  is  Nl) 

AND  (Proportional  #  linked  segments  is  N2) 

THEN  (Hypothesis  Truth  that  segment  S  represents  a  valid  speaker  is  V) 

e)  Neural  Network:  fixed  parameters 

el)  Fuzzy  Hypereube/Fuzzy  ART: 

Maximum  number  of  attributes  in  a  pattern  [NN_MAXATTR  =5  0] 
Maximum  number  of  opinions  per  pass  [NN_MAXOPINIONS  =  2] 
Maximum  number  of  class  that  may  be  formed  pSIN_MAXCLASS  =  50] 
Lower  limit,  upper  limit  initialization  value  [LL_Init  =  1.0,  UL_Init  =  0.0] 
e2)  Basie  ART: 

Maximum  number  of  attributes  in  a  pattern  [NN  MAXATTR  =  200] 
Maximum  number  of  opinions  per  pass  [NN_MAXOPINIONS  =  2] 
Feedback  from  top  layer  [NN_TOPDOWN_FEEDBACK  =  0.8] 

Maximum  number  of  pattern  identifier  [NN_MAXID  =  20] 

Degree  of  functional  expansion  [NN_FUNC_EXPAND  =10] 

Lower  limit,  upper  limit  initialization  value  [LL_Init  =  0.0,  UL  Init  =1.0] 

f)  Neural  Network:  variable  parameters 

Vigilance 

Maximum  Hypervolume 

g)  Overall  System  variable  data  &  parameters 

Test  Data  Sets 
Number  of  Speakers 

General  ART  vs  Fuzzy  ART  vs  Fuzzy  Hypercube  ART 
Number  of  eorrect  &  ineorreet  classifications  per  Test  Set 

5.3  Test  Results 

The  test  data  in  5.1  were  applied  according  to  the  test  conditions  of  section  5.2,  and  the 
results  are  reported  in  this  section.  There  were  several  parametric  tests;  measurements 
were  made  on  each  test  run  in  the  following  sections. 

5.3.1  Feature  Processing 

Features  were  analyzed  for  two  basic  characteristics,  separability,  and  maximum/ 
minimum  values.  The  separability  were  observed  using  the  XGOBI  visualization  tool.  It 
allows  a  multidimensional  viewing  of  the  features  and  their  clustering  ability.  The 
max/min  values  were  determined  from  a  basic  test  set  not  included  in  the  test  results. 


29 


5.3.2  Neural  Network 

The  following  are  parameters  which  were  varied  in  the  neural  network  during  the  testing: 
Total  Number  of  Actual  Speakers  Correctly  Identified  (il),  >=  1  class  per  node 
Total  Number  of  Invalid  HCNs  generated  (i2) 

Total  Number  of  Invalid  HCNs  generated  (i3) 

Total  Number  of  Invalid  HCN’s  generated  (i4) 

The  value  of  i,  is  a  count  of  the  correct  number  of  HCN’s  generated  by  the  NN  which 
corresponds  to  real  speakers.  This  gives  a  number  of  the  correct  number  of  categories 
generated,  independent  of  the  number  of  data  sets  presented  to  the  network.  The  value  of 
12  is  a  count  of  the  number  of  HCN’s  generated  by  the  NN  which  are  in  addition  to  the  set 
il- 

{HCN)=  2;{/,+4) 

allHCN's  ^47^ 

where  HCN  is  a  set  of  hypercube  category  nodes  generated  during  a  complete  test  run,  ij 
is  a  count  of  correct  HCN’s  and  ij  is  the  incorrect  HCN's  count.  Table  4  and  Table  6 
both  display  the  results  of  Ij  as  a  function  of  the  vigilsince  parameter  and  the  maximum 
hypervolume  within  a  small  range  of  values. 

The  values  of  ij  and  i4  are  spurious  nodes  generated  and  count  of  data  sets  in  the  spurious 
nodes.  These  values  do  not  affect  the  values  of  correct/incorrect  classification  since  they 
generally  consist  of  nodes  with  only  one  or  two  entries,  which  is  the  definition  of  a 
spurious  node.  Table  5  and  Table  7  display  the  spurious  category  creation  in  the  network 
as  a  function  of  vigilance  parameter  and  maximum  hypervolume,  again  within  the  same 
small  range  of  values. 

Summarized  test  results  for  the  fuzzy  hypercube  neural  network  performance  are  shown 
in  Table  5. 


Test  Data  Set  for  8 

Total  Number  of 

Total  Voiced 

Overall  Correct 

Speakers 

Speakers  in  Test 

Speaking  Time  (hrs) 

Classification  (%) 

Switchboard  May  95 

26 

2.69 

69.7 

Greenflag 

41 

2.96 

70.3 

TABLE  4:  Test  Results  for  8-Speaker  Group 


5.3.3  Overall  System 

Parameters  which  are  a  measure  of  the  overall  system  are  given  in  this  section. 
Number  of  correct  and  incorrect  classifications  per  Test  Set 
Total  time  per  Test  Set 
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The  average  overall  percent  correct  classification  is  defined  by: 

TD2„, 

TD2„,  +  TD3„,\ 

p  -  y  - -  (48) 

^  ^  M 

overM test  sets 

The  numerator  of  the  summation  of  Eq.  37  is  the  mean  of  each  individual  test  performed, 
while  the  exterior  summation  averages  all  the  average  classification  fi’actions. 

A  series  of  tests  which  used  the  value  of  Eq.  37  were  performed.  First,  two  basic  tests 
were  run  to  evalute  the  effects  of  the  vigilance  and  hypervolume  limit  on  Pc.  These  are 
shown  in  Tables  3  and  4.  The  absolute  values  of  minimum  and  maximum  obtained 
during  the  entire  test  period  are  shown  in  relation  to  the  mean  value  which  is  plotted 
against  the  vigilance  and  hypervolume  limit  values. 

The  generation  of  the  correct  (C)  and  incorrect  (I)  classifications  are  related  to  the  neural 
network  values  11-14,  but  were  visually  chosen  fi’om  these  sets  as  the  values  which 
provided  the  greatest  correct  classifications  per  HCN.  This  would  require  a  simple 
program,  which  has  the  maximum  number  of  entries  as  correct  nodes,  to  choose  the 
distinct  nodes.  Also,  the  totals  generated  under  the  neural  network  required  additional 
data  analysis  and  speaker  truth. 

The  summarized  results  for  the  overall  Switchboard  and  Greenflag  data  sets  taken  for  8 
and  12  speakers  are  given  in  Tables  6  and  7. 

Test  Data  Set  for 
12  Speakers 
Switchboard  May  95 
Greenflag _ 

TABLE  5:  Test  Results  for  12-Speaker  Group 


TABLE  6:  Fuzzy  Hypercube  Neural  Network  Test  Results 


Test  Data  Set  for  8 
Speakers 

Switchboard  May  95 
Greenflag  _ 


Average  Number  of  Average  Number  of  Average  Number  of 


Correct  Categories 

False  Categories 

False  Categories 

Generated  (8  max) 

Generated  (8  max) 

Deleted  per  Data 

Set 

6.29 

0.29 

1.86 

6.57 

0.23 

5.77 

Total  Number  of 
Speakers  in  Test 


Total  Voiced  Overall  Correct 
Speaking  Time  (mins)  Classification  (%) 


67.25 

68.75 


E  ( 

test  set  ni 


The  overall  system  test  results  are  shown  in  Table  7.  This  includes  all  speaker  groups. 


Test  Data  Set  for 

Total 

Total  Voiced 

Overall  Correct 

Standard 

Maximum- 

all  Speaker  Groups 

Number  of 

Speaking 

Classification 

Deviation 

Minimum 

Speakers 

Time  (hrs) 

(%) 

(avg) 

(^g) 

Switchboard  May  95 

26 

3 

66.9 

5.0 

14.5 

Greenflag 

rwi  k 

41 

3 

j  rvn 

66.6 

6.6 

13.4 

TABLE  7:  Overall  System  Test  Results 


6.  DISCUSSION 


6.1  Overall 

The  Overall  testing  results  are  shown  in  Tables  1,  2,  5  and  6.  The  results  are  synopsized 
in  Table  7,  giving  the  standard  deviation  averaged  over  all  groups  for  each  group,  as  well 
as  the  maximum  to  minimum  value  spread  averaged  over  all  the  groups. 

From  these  data,  it  can  be  seen  that  Greenflag  had  a  smaller  minimum  to  maximum 
spread,  and,  with  the  exception  of  group  #7,  all  appear  well  behaved.  In  the  switchboard 
case,  the  spread  is  much  more  in  all  groups  with  number  13  the  greatest.  However,  the 
switchboard  data  were  still  more  well  behaved  and  better  clustered  as  is  showm  by  their 
better  standard  deviation  value  shown  in  Table  7. 

The  performance  of  the  test  groups  is  nearly  identical  at  67%,  but  this  is  for  an  8  speaker 
group  maximum. 

6.2  Recommendations  for  Future  Research  and  Improvements 

The  recommendations  for  improving  the  current  system  with  changes,  and  additional 
areas  of  research  are  presented  for  the  features,  classifier,  and  overall  system.The 
following  are  areas  that  can  be  investigated  for  improvement  to  the  speaker  recognition 
process: 

a)  Inclusion  of  new  features.  The  inclusion  of  new  features  is  a  constant  improvement 
which  can  be  made  to  the  Speaker  Recognition  System  feature  processing.  Some  of  the 
features  which  may  be  of  use  are: 

1 .  Delta  Cepstrum 

2.  Cepstrum  with  mean  removal 

3.  Log  Energy 

4.  Average  Pitch 

5.  RASTA/PLP 

b)  Expansion  of  input  through  complementation 

c)  Inclusion  of  listener  models 

e)  Inclusion  of  specific  verbal  cue  modeling  for  specific  languages. 
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